Skip to content

feat(xqwatcher): migrate from EC2 ASG to Kubernetes Deployment#4287

Merged
blarghmatey merged 46 commits intomainfrom
feat/xqwatcher-kubernetes-migration
Mar 24, 2026
Merged

feat(xqwatcher): migrate from EC2 ASG to Kubernetes Deployment#4287
blarghmatey merged 46 commits intomainfrom
feat/xqwatcher-kubernetes-migration

Conversation

@blarghmatey
Copy link
Member

@blarghmatey blarghmatey commented Mar 11, 2026

Summary

Migrates xqueue-watcher infrastructure from EC2 Auto Scaling Groups with AppArmor/codejail sandboxing to a Kubernetes Deployment using container-based grading. This is the infrastructure companion to mitodl/xqueue-watcher#14 which implements the ContainerGrader backend.

Changes

src/ol_infrastructure/lib/ol_types.py

  • Added xqwatcher to both Services and Application enums for consistent K8s label generation.

src/ol_infrastructure/applications/xqwatcher/xqwatcher_server_policy.hcl

  • Added read access to secret-DEPLOYMENT/edx-xqueue so the grader handler config (stored in Vault) can embed the xqueue server URL and authentication password.

src/ol_infrastructure/applications/xqwatcher/__main__.py

Complete rewrite replacing EC2 resources with Kubernetes resources:

Old (EC2) New (K8s)
IAM instance profile + Vault AWS auth OLEKSAuthBinding (IRSA + Vault K8s auth)
EC2 Launch Template + ASG Kubernetes Deployment
AMI with codejail/AppArmor mitodl/xqueue-watcher (DockerHub) container image
Consul config distribution ConfigMap + OLVaultK8SSecret CRD

New Kubernetes resources created:

  • OLEKSAuthBinding — IRSA role + Vault Kubernetes auth backend role
  • OLVaultK8SSecret — syncs grader handler config from Vault KV to a K8s Secret via Vault Secrets Operator
  • ConfigMap — base poll settings (xqwatcher.json) and stdout-only structured logging (logging.json)
  • Role + RoleBinding — grants xqwatcher pods permission to create/delete Jobs and read pod logs (required by ContainerGrader's Kubernetes backend)
  • Deployment — runs xqueue-watcher with non-root security context, resource limits, liveness probe, and topology spread for HA

Stack configs (9 files)

Removed EC2-specific keys (consul:address, auto_scale, instance_type) and added K8s-specific keys:

  • xqwatcher:cluster — EKS cluster name
  • xqwatcher:namespace — Kubernetes namespace
  • xqwatcher:min_replicas / max_replicas
  • xqwatcher:docker_tag

Deployment Prerequisites

Before applying this stack:

  1. Build and push mitodl/xqueue-watcher image to DockerHub (from PR Adding more precise filtering for VPC and subnet imports #14)
  2. Build and push course grader images (e.g. from MITx/graders-mit-600x#10)
  3. Update Vault secret secret-xqwatcher/{env}-grader-config with confd_json containing a ContainerGrader handler config
  4. Ensure Vault Secrets Operator is installed in the target cluster

Related PRs

blarghmatey and others added 5 commits March 11, 2026 12:22
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add read access to secret-DEPLOYMENT/edx-xqueue so the xqwatcher
service can retrieve the xqueue server URL and authentication
password needed by the ContainerGrader handler config.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Completely rewrite the xqwatcher Pulumi stack to deploy on Kubernetes
instead of EC2 Auto Scaling Groups with AppArmor/codejail.

Changes:
- Replace IAM instance profile + Vault AWS auth with OLEKSAuthBinding
  (IRSA + Vault K8s auth backend)
- Add OLVaultK8SSecret to sync grader handler config from Vault KV
  to a Kubernetes Secret via the Vault Secrets Operator CRD
- Add a ConfigMap for base poll settings and structured JSON logging
  to stdout (no log rotation in containers)
- Add RBAC Role + RoleBinding granting the xqwatcher service account
  permission to create/delete Kubernetes Jobs and read pod logs,
  required by ContainerGrader's kubernetes backend
- Create a Kubernetes Deployment with:
  - ghcr.io/mitodl/xqueue-watcher image
  - Security context (non-root, drop ALL capabilities)
  - Resource requests + memory limit
  - Liveness probe via python -c import xqueue_watcher
  - Topology spread for HA across nodes
  - Vault grader config + base config mounted into /xqwatcher/conf.d/
- Preserve vault.kv.SecretV2 write so grader config remains managed
  in Pulumi
- Export k8s_deployment_name and k8s_namespace

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove EC2-specific settings (consul:address, auto_scale, instance_type)
and add Kubernetes-specific settings for all stacks:

- xqwatcher:cluster — EKS cluster name (residential or applications)
- xqwatcher:namespace — target Kubernetes namespace
- xqwatcher:min_replicas — minimum pod count (maps from auto_scale.desired)
- xqwatcher:max_replicas — maximum pod count (maps from auto_scale.max)
- xqwatcher:docker_tag — container image tag (default: latest)

Cluster assignments:
- mitx, mitx-staging → residential cluster
- mitxonline → applications cluster

Namespace assignments follow xqueue convention:
- mitx → mitx-openedx
- mitxonline → mitxonline-openedx
- mitx-staging → mitx-staging-openedx

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@blarghmatey blarghmatey requested a review from Copilot March 18, 2026 18:50
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Migrates the xqueue-watcher (xqwatcher) infrastructure in ol-infrastructure from an EC2/ASG-based deployment to a Kubernetes Deployment on EKS, aligning with the ContainerGrader-based runtime introduced in the application repo.

Changes:

  • Adds xqwatcher to shared enum types to support consistent labeling.
  • Updates Vault policy to allow reading xqueue server credentials.
  • Replaces the xqwatcher EC2 stack with Kubernetes resources (Vault auth binding + VSO-synced secret + ConfigMap + RBAC + Deployment) and updates stack configs accordingly.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/ol_infrastructure/lib/ol_types.py Adds xqwatcher to Services/Application enums for consistent labels.
src/ol_infrastructure/applications/xqwatcher/xqwatcher_server_policy.hcl Extends Vault policy to read xqueue server secret path.
src/ol_infrastructure/applications/xqwatcher/__main__.py Full rewrite: provisions Vault+IRSA binding, VSO secret sync, ConfigMap, RBAC, and a Deployment for xqueue-watcher.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitxonline.QA.yaml Updates stack config to K8s-focused settings (cluster/namespace/replicas/docker tag).
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitxonline.Production.yaml Same as above for Production.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitxonline.CI.yaml Same as above for CI.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.QA.yaml Updates config for residential mitx QA.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.Production.yaml Updates config for residential mitx Production.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.CI.yaml Updates config for residential mitx CI.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx-staging.QA.yaml Updates config for mitx-staging QA.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx-staging.Production.yaml Updates config for mitx-staging Production.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx-staging.CI.yaml Updates config for mitx-staging CI.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

blarghmatey and others added 3 commits March 18, 2026 15:35
- Add create_irsa_service_account flag to OLEKSAuthBinding to
  optionally create the K8s ServiceAccount with IRSA annotation;
  use it in xqwatcher to fix 'serviceaccount not found' pod error
- Add XQWATCHER_* env vars to Deployment matching env_settings.py;
  expose http_basic_auth from Vault-synced secret via VSO template
- Fix image reference from ghcr.io to mitodl/ (DockerHub)
- Change imagePullPolicy to Always for mutable 'latest' tag
- Rename XQWATCHER_DOCKER_DIGEST to XQWATCHER_DOCKER_TAG
- Remove unused network_stack StackReference
- Remove dead xqwatcher:target_vpc config key from all 9 stacks
- Remove unimplemented xqwatcher:max_replicas from all 9 stacks

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The manager CLI only accepts -d/--config_root; it auto-discovers
xqwatcher.json and logging.json from that directory. Remove the
non-existent --config and --logging-config flags.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
blarghmatey and others added 4 commits March 18, 2026 16:44
…pods

ContainerGrader calls k8s_config.load_incluster_config() which reads
the service account token from the projected volume at
/var/run/secrets/kubernetes.io/serviceaccount/token. The xqwatcher
ServiceAccount has automount_service_account_token=False (secure
default), so the PodSpec must explicitly opt in to have the token
mounted, otherwise all Kubernetes Job API calls will fail with a
ConfigException.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…e tag

When the Concourse pipeline populates XQWATCHER_DOCKER_DIGEST, build
the image ref as mitodl/xqueue-watcher@sha256:... (immutable digest)
so Kubernetes always pulls exactly the image that was built and tested.
Fall back to :tag from stack config only when the digest is unavailable
(e.g. manual deploys). imagePullPolicy: Always is retained so new
digests are always pulled on rollout.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The uv virtualenv bin directory is not on PATH in the container, so
the 'xqueue-watcher' console script can't be found directly. Use
'uv run xqueue-watcher' to invoke it through uv's environment, which
correctly resolves the script installed in the project virtualenv.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
uv run without --no-sync attempts to sync the virtualenv at startup,
which fails in the container (no write access / network). Use
--no-sync to run the already-installed entrypoint as-is.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
blarghmatey and others added 11 commits March 18, 2026 17:22
configure_from_directory(path) reads xqwatcher.json and logging.json
directly from path, then globs path/conf.d/*.json for queue watcher
configs. We were passing -d /xqwatcher/conf.d and mounting everything
flat there, so the manager looked for watchers at
/xqwatcher/conf.d/conf.d/*.json (not found).

Fix: pass -d /xqwatcher and restructure mounts:
  /xqwatcher/xqwatcher.json      <- manager config (ConfigMap)
  /xqwatcher/logging.json        <- logging config (ConfigMap)
  /xqwatcher/conf.d/grader_config.json  <- queue watchers (Vault secret)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
VSO renders secret values via Go templates: {{ .Secrets.confd_json }}.
When confd_json is stored as a nested object, VSO renders a Go map
literal (map[...]) rather than valid JSON, causing a JSONDecodeError
at startup. Pre-serialize confd_json to a JSON string so the template
renders parseable JSON.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…llback

Match the keycloak pattern: require the digest env var so the image is
always pinned to an immutable digest. Remove the mutable :latest tag
fallback that allowed manual pulumi-up runs to silently deploy an
uncontrolled image. Also remove the unused xqwatcher:docker_tag config
key from all stack YAML files.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…gh cache

When the SOPS secret's confd_json contains a ContainerGrader handler
whose KWARGS include an 'image' key, rewrite that value through
cached_image_uri() before writing to Vault. This means the SOPS secret
stores a plain DockerHub reference (e.g. mitodl/mit-600x-grader:latest)
and Pulumi transforms it to the ECR pull-through cache URI at deploy
time, keeping grading Jobs free from DockerHub rate limits.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The CodeQL 'Analyze (actions)' job (exit code 32) fails because the
extractor finds .github/workflows/*.yml and .github/actions/**/*.yml
but cannot process any of them. This is a known extractor-level issue
with CodeQL 2.24.x on Erk agent workflow patterns.

Excluding .github from CodeQL's path analysis silences the fatal error
while leaving Python and JavaScript/TypeScript scans unaffected.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add src/ol_concourse/pipelines/open_edx/grader_images/ with three pipeline
definitions for building and publishing containerized course grader images
to private ECR.

base_image_pipeline.py:
  Builds grader_support/Dockerfile.base from the xqueue-watcher repo and
  pushes to both DockerHub (mitodl/xqueue-watcher-grader-base, public) and
  ECR (610119931565.dkr.ecr.us-east-1.amazonaws.com/mitodl/xqueue-watcher-
  grader-base, private). Triggered by changes to grader_support/ in the
  xqueue-watcher repo. The ECR push is the trigger source for downstream
  per-grader build pipelines.

build_pipeline.py:
  GraderPipelineConfig dataclass and grader_image_pipeline() factory for
  per-grader-repo build pipelines. Triggered by new commits to the grader
  repo OR a new base image digest in ECR. The Docker build receives
  GRADER_BASE_IMAGE=repo@sha256:... resolved at runtime via a sh wrapper
  around oci-build-task's build script (the only way to inject a
  file-derived BUILD_ARG in Concourse; params are static strings).
  Pushes to private ECR only. GRADER_PIPELINES list seeded with
  graders-mit-600x.

meta.py:
  Self-updating meta pipeline that creates and maintains the base image
  pipeline and one build pipeline per GRADER_PIPELINES entry. Triggered
  by changes to the grader_images/ pipeline code in ol-infrastructure.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ines

- base_image_pipeline: use chore/migrate-to-uv-and-k8s-container-grader
  branch of xqueue-watcher (where Dockerfile.base updates live)
- build_pipeline: track feat/containerized-grader for graders-mit-600x
- Fix E501 in both files: split long strings to stay within 88-char limit

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The CONTEXT was grader_support/ which caused the COPY grader_support/
instruction in Dockerfile.base to fail (no nested grader_support/ inside
the context). Use the repo root as CONTEXT so the COPY can locate the
directory relative to it.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…images

Add ensure_ecr_task() helper to ol_concourse/lib/containers.py (mirrors the
pattern used in the dagster docker_pulumi_pipeline). The task runs the AWS
CLI to check for the ECR repository and creates it if missing, so the first
pipeline run does not fail on a missing registry.

Apply to both grader image pipelines:
- base_image_pipeline: ensures mitodl/xqueue-watcher-grader-base exists
  before pushing to ECR
- build_pipeline: ensures the per-grader ECR repo (config.ecr_repo_name)
  exists before pushing the course grader image

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When ecr_region is set, the registry-image resource automatically
constructs the full ECR URI as {account}.dkr.ecr.{region}.amazonaws.com/{repository}.
Passing the full URI in image_repository caused the hostname to be doubled
in API calls, resulting in NAME_UNKNOWN errors.

- Remove ecr_image_uri property from GraderPipelineConfig
- Fix grader_base_ecr_repo default to use repo-name-only string
- Change registry_image(image_repository=config.ecr_image_uri) to
  registry_image(image_repository=config.ecr_repo_name)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The grader-images-pipeline-code git resource was tracking 'main', but
the pipeline files don't exist on main yet. Switch to the feature branch
until this work is merged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
blarghmatey and others added 2 commits March 19, 2026 15:48
The graders-mit-600x repository is private. Switch the git resource from
an HTTPS git_repo to an ssh_git_repo so Concourse can clone it. The SSH
private key is read from Vault at ((github.ssh_private_key)).

- Import ssh_git_repo instead of git_repo
- Add github_private_key field to GraderPipelineConfig (defaults to
  ((github.ssh_private_key)))
- Update grader_repo_url in GRADER_PIPELINES to use SSH form
  (git@github.com:mitodl/graders-mit-600x)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
infrastructure/github has no generic SSH key. The correct key for
cloning private mitodl repos from the infrastructure Concourse team
is odlbot_private_ssh_key in infrastructure/open_api_clients.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
blarghmatey and others added 2 commits March 20, 2026 11:28
…onfig + SERVER_REF

Queue configs (CONNECTIONS, HANDLERS, ContainerGrader KWARGS) are now
stored as plaintext in Pulumi stack YAML files under xqwatcher:queues.
The xqueue server URL is stored under xqwatcher:xqueue_server_url.

SERVER_REF is injected at deploy time so xqueue-watcher resolves
credentials at runtime from xqueue_servers.json, which is mounted from
a Vault-synced Kubernetes Secret.  The secret is sourced from the same
secret-{env_prefix}/edx-xqueue Vault KV path already used by the xqueue
and edxapp deployments (xqwatcher_password field), eliminating the
separate xqwatcher-specific KV mount and SOPS secrets files.

Changes:
- __main__.py: remove SOPS read, vault.kv.SecretV2, vault_mount_stack
  StackReference, and XQWATCHER_HTTP_BASIC_AUTH env var; read queues
  config from Pulumi config; inject SERVER_REF into each queue entry;
  move grader_config.json into ConfigMap; add xqueue_servers.json
  Vault-synced secret from secret-{env_prefix}/edx-xqueue; update
  Deployment volumes/mounts accordingly
- xqwatcher_server_policy.hcl: remove secret-xqwatcher/* path
- All 9 stack YAML files: add xqueue_server_url and queues config

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add AWS_DEFAULT_REGION=us-east-1 to ensure_ecr_task params so the
  AWS CLI knows which region to use without relying on worker defaults
- Remove spurious service_account_name kwarg from OLVaultK8SResourcesConfig
  instantiation in OLEKSAuthBinding; the field does not exist on the
  model and the name is derived internally from application_name
- Fix liveness probe to use 'uv run --no-sync python' instead of bare
  'python', which would fail with ModuleNotFoundError because
  xqueue_watcher is only available inside the uv virtual environment

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Migrates the xqueue-watcher infrastructure from an EC2/ASG deployment to Kubernetes, updating Vault access and stack configuration, and adding Concourse pipelines to build/publish grader container images used by the new ContainerGrader flow.

Changes:

  • Add xqwatcher to shared enums used for labeling.
  • Replace the xqwatcher stack’s EC2 resources with Kubernetes resources (Deployment, RBAC, ConfigMap, Vault Secrets Operator integration).
  • Add Concourse pipelines to build a grader base image and course-specific grader images, and update stack YAML configs for the new K8s-based deployment.

Reviewed changes

Copilot reviewed 19 out of 20 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/ol_infrastructure/lib/ol_types.py Adds xqwatcher to enums used for consistent label generation.
src/ol_infrastructure/components/applications/eks.py Extends OLEKSAuthBinding to optionally create IRSA ServiceAccount(s).
src/ol_infrastructure/applications/xqwatcher/xqwatcher_server_policy.hcl Adjusts Vault policy to allow reading xqueue credentials from the shared secret path.
src/ol_infrastructure/applications/xqwatcher/main.py Replaces EC2-based deployment with K8s Deployment + RBAC + ConfigMap + VSO-managed secrets.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitxonline.QA.yaml Updates stack config from EC2 params to K8s params + queue definitions.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitxonline.Production.yaml Updates stack config from EC2 params to K8s params + queue definitions.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitxonline.CI.yaml Updates stack config from EC2 params to K8s params + queue definitions.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.QA.yaml Updates stack config from EC2 params to K8s params + queue definitions.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.Production.yaml Updates stack config from EC2 params to K8s params + queue definitions.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.CI.yaml Updates stack config from EC2 params to K8s params + queue definitions.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx-staging.QA.yaml Updates stack config from EC2 params to K8s params + queue definitions.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx-staging.Production.yaml Updates stack config from EC2 params to K8s params + queue definitions.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx-staging.CI.yaml Updates stack config from EC2 params to K8s params + queue definitions.
src/ol_concourse/pipelines/open_edx/grader_images/meta.py Adds a self-updating meta pipeline that creates/updates grader image pipelines.
src/ol_concourse/pipelines/open_edx/grader_images/build_pipeline.py Adds reusable pipeline generator for course-specific grader images.
src/ol_concourse/pipelines/open_edx/grader_images/base_image_pipeline.py Adds pipeline generator for building/publishing the shared grader base image.
src/ol_concourse/pipelines/open_edx/grader_images/init.py Initializes the new grader_images pipeline package.
src/ol_concourse/lib/containers.py Adds a reusable task step to ensure an ECR repository exists before pushing.
src/bridge/secrets/xqwatcher/secrets.mitx.ci.yaml Updates encrypted xqwatcher grading configuration secrets for the new backend.
.github/codeql/codeql-config.yml Adds CodeQL config to exclude .github from actions extraction failures.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

blarghmatey and others added 2 commits March 20, 2026 13:22
Replace the old Packer-based xqwatcher pipeline with a Docker+Pulumi
pipeline that mirrors the xqueue pattern:

- Watches mitodl/xqueue-watcher (main) for new commits
- Builds and pushes the Docker image to DockerHub as
  mitodl/xqueue-watcher:{release}
- Passes the built image digest as XQWATCHER_DOCKER_DIGEST to each
  Pulumi stack so the Deployment rolls to the exact image SHA

Update meta.py to generate docker-pulumi-xqwatcher-{release} pipelines
instead of the retired packer-pulumi ones.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Unpin grader_images meta pipeline from feature branch; track main
- Unpin xqueue-watcher base image source from dev branch; track main
- Unpin graders-mit-600x grader repo from feature branch; track main
- Fix base_image_pipeline.py docstring: downstream pipelines trigger off
  the DockerHub push, not the ECR push
- Add xqwatcher:docker_tag config fallback for XQWATCHER_DOCKER_DIGEST
  so pulumi up can run without the env var set (matches xqueue pattern)
- Remove env vars that duplicate xqwatcher.json ConfigMap values
  (POLL_TIME, REQUESTS_TIMEOUT, POLL_INTERVAL, FOLLOW_CLIENT_REDIRECTS);
  keep only LOGIN_POLL_INTERVAL and GRADER_* which are not in the ConfigMap
- Update PR description: image is on DockerHub, not GHCR

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
blarghmatey and others added 2 commits March 20, 2026 14:06
Register the MIT 6.686x course-specific grader image in GRADER_PIPELINES
so the meta pipeline creates a build-graders-mit-686x-image Concourse
pipeline that tracks the graders-mit-686x repo and pushes to ECR at
mitodl/graders-mit-686x.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
blarghmatey and others added 3 commits March 20, 2026 16:04
Add the edxorg-686x queue to the mitxonline production xqwatcher stack
using the ContainerGrader handler, replacing the legacy JailedGrader
configuration in confd_json. This is in preparation for deployment of
the xqueue-watcher changes in mitodl/xqueue-watcher#14.

The memory limit is set to 1Gi (vs 512Mi for 600x) to accommodate the
torch dependency used by the mnist problem set graders.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add an "edxorg" entry to the xqueue_servers.json Vault template so that
queues using SERVER_REF "edxorg" resolve credentials for
https://xqueue.edx.org. The template variables edxorg_xqueue_username
and edxorg_xqueue_password must be added to the existing edx-xqueue
Vault KV secret.

Update the queue config loop to use setdefault so that queues can
declare their own SERVER_REF in the Pulumi stack config rather than
always being assigned "default".

Set SERVER_REF: edxorg on the edxorg-686x queue in the mitxonline
production stack config.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ofile

Accommodates changes from xqueue-watcher commit 15fdd86
(security: harden containergrader and XQueue client):

- Expose XQWATCHER_VERIFY_TLS via xqwatcher:verify_tls Pulumi config
  (default "true"; set "false" only for dev envs with self-signed certs).
- Expose XQWATCHER_SUBMISSION_SIZE_LIMIT via xqwatcher:submission_size_limit
  Pulumi config (default 1 MB, matching containergrader default).
- Add RuntimeDefault seccomp profile to the xqwatcher pod's
  PodSecurityContextArgs, mirroring the profile now applied to grading
  Jobs in containergrader.py for defence-in-depth consistency.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@blarghmatey blarghmatey requested a review from Copilot March 23, 2026 17:13
@blarghmatey
Copy link
Member Author

/gemini review

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 21 changed files in this pull request and generated 16 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request represents a significant refactor, migrating the xqueue-watcher service from an EC2-based deployment to a Kubernetes-based deployment on EKS. Key changes include updating Pulumi configurations to define Kubernetes-specific settings for xqwatcher queues and container grader parameters, and a complete rewrite of the main Pulumi application file to provision Kubernetes Deployments, ConfigMaps, Secrets (via Vault K8s Secrets Operator), and RBAC roles. New Concourse pipelines have been introduced to build and publish xqueue-watcher base and course-specific grader images to DockerHub and ECR, along with a meta-pipeline to manage them. Additionally, a utility function ensure_ecr_task was added for creating ECR repositories, and Vault policies and EKS authentication bindings were updated to support the new Kubernetes deployment model. The review comments highlight the need to use immutable tags for grader images in production environments instead of :latest to ensure predictable deployments, and suggest refactoring the duplicated queues configuration across multiple Pulumi stack files for improved maintainability.

blarghmatey and others added 2 commits March 23, 2026 13:51
- Use ':' separator for tags and '@' only for sha256 digests when
  building the xqwatcher docker_image_ref; rename env var from
  XQWATCHER_DOCKER_DIGEST to XQWATCHER_DOCKER_TAG to match the config
  key name
- Fix misleading comment on ECR base image resource in
  base_image_pipeline.py: downstream grader pipelines trigger off
  DockerHub, not ECR
- Remove automount_service_account_token=False from IRSA ServiceAccount
  created by OLEKSAuthBinding so the projected token is mounted and
  IRSA can authenticate via sts:AssumeRoleWithWebIdentity

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
blarghmatey and others added 3 commits March 24, 2026 12:28
Add a HorizontalPodAutoscaler (autoscaling/v2) targeting the xqwatcher
Deployment, scaling on:
- CPU: 60% average utilization
- Memory: 80% average utilization

Scale-up is aggressive (up to 100% more pods per minute, 60s
stabilization) to handle submission bursts; scale-down is conservative
(≤25% reduction per minute, 5-minute stabilization) to avoid thrashing.

Min/max replica bounds are configurable via xqwatcher:min_replicas and
xqwatcher:max_replicas stack config (defaults: 1 and 5).

The Deployment gains ignore_changes=["spec.replicas"] so Pulumi does
not revert the replica count that the HPA manages between stack updates.

Exports k8s_hpa_name for stack consumers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The original design embedded edxorg credentials in the same
VaultStaticSecret as the MIT-hosted xqueue server, referencing
edxorg_xqueue_username / edxorg_xqueue_password keys that do not exist
in secret-<env>/edx-xqueue (which only holds edxapp_password and
xqwatcher_password).

Instead, create a fully independent Deployment per xqueue server:

- xqwatcher (default): watches queues targeting the MIT-hosted xqueue.
  Reads credentials from secret-<env>/edx-xqueue.  Only queues with
  SERVER_REF="default" (or no SERVER_REF) are included in its ConfigMap.

- xqwatcher-edxorg (optional): watches queues with SERVER_REF="edxorg".
  Reads credentials from a separate secret-<env>/edxorg-xqueue Vault
  path, created only when xqwatcher:edxorg_xqueue_enabled=true.

Each Deployment has its own ConfigMap (scoped to its own queue subset),
VaultStaticSecret, and HPA; they share the xqwatcher ServiceAccount and
RBAC Role since both need identical permissions to manage grading Jobs.
The Vault policy gains a read grant for secret-DEPLOYMENT/edxorg-xqueue.

This eliminates the need for a file-merge initContainer and gives each
server integration independent observability, scaling, and secret access.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@blarghmatey blarghmatey merged commit caf670a into main Mar 24, 2026
7 of 8 checks passed
@blarghmatey blarghmatey deleted the feat/xqwatcher-kubernetes-migration branch March 24, 2026 18:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants